Circuit Breaker Pattern
Let's learn about how circuit breakers help keep our services available.
We'll cover the following
Introduction #
Circuit breakers are a staple in every modern house. The electricity in our houses comes through the main grid and flows through the circuit breakers. There’s a chance that the grid might behave abnormally once in a while, causing an electrical surge that the wiring of our house might not be able to tolerate. Circuit breakers help us prevent the above scenario, protecting our wiring and appliances by switching them off if they detect an abnormally high amount of power.
The circuit breaker pattern that’s often used in API design functions is quite similar to electrical circuit breakers. It acts as a protective layer of our APIs, preventing them from receiving more requests than they can handle. It’s an effective method for increasing the availability of our services by detecting and potentially preventing cascading failures. It allows us to create a fault-tolerant system that can survive when key services are unavailable.
When deploying a service, clients are often assured that the application will be available 99.999% of the time. This allows for only 0.001% of downtime. Let’s take a look at some calculations to see exactly what that entails:
A downtime of 0.001% equals 5.256 minutes. This means that our service will be down for around 5 minutes a year. Now, 5 minutes in a year doesn’t sound like much, but there’s something we haven’t considered yet.
An application can depend on hundreds of microservices running to handle all the tasks the application needs to carry out. Let’s suppose that our application has 250 microservices running, each with a downtime of 5.256 minutes a year. Assuming each of these 250 services can fail at different times, and the failure of any one of the services means failure of the overall service, then the overall downtime will be as follows:
So, if we have 250 microservices—even with a downtime of merely 0.001%—then we have a downtime of nearly 22 hours per year. Such high numbers are not acceptable and that’s where patterns like circuit breaking shine.
The circuit breaker pattern#
The circuit breaker pattern is straightforward and has three states in its life cycle:
Closed: This is considered to be the normal state. In the closed state, all the calls being made to the service pass through, and the number of failures is monitored.
Open: If the microservice experiences slowness, the circuit breaker begins getting failures for requests to that service. Once this number of failures passes a certain limit, the breaker goes into the open state and activate a timeout for the service. In this state, any requests sent to the microservice are stopped by the circuit breaker immediately. Instead, the circuit breaker can send a response to the client, informing it about the timeout of the microservice and if an alternative service is available. It also advises the clients to send the requests to that service. The client can then either choose to redirect to a different instance of the service or choose to wait until the service is back online. This gives the microservice some time to recover by eliminating the load it has to handle. Once the timeout has expired, the breaker will pass to a half-open state.
Note: When the circuit breaker is in the open state, it prevents the requests from reaching the microservice. This is not a failure of the service but rather an interception by the circuit breaker to give the downed microservice time to recover. Without this intervention, microservices can fail in a cascading fashion, and rebooting such a complex system is often time consuming. Our overall service should be designed in such a way that if few dependent services are not available temporarily, the overall service degrades slowly (instead of a total outage).
Half-open: In this state, a limited number of requests are allowed to pass through to the microservice. This is done to test if the underlying issue is still persisting. Even if a single call allowed to the microservice fails in the half-open state, then the breaker trips and goes back to the open state. However, if all the calls sent succeed, then the breaker resets to the closed state and begins operating as normal.
1 of 2
2 of 2
Note: A request is considered failed if the microservice isn't able to respond within a specific time defined for that service. A rejection from the circuit breaker is not considered as a failed service. In the half-open state, the circuit breaker rejects some requests by itself and allows limited requests to go through to the microservice to test its functionality.
Here, we depict the lifecycle of a circuit breaker and the events that trigger its different states:
Now that we know what circuit breakers are and how they work, let's take a look at some scenarios where we can take advantage of circuit breakers.
Example scenario#
Let's suppose we have an application with five different services. When a service gets a request, the application will allocate a process to call that service. It’s possible that these services may fail for any reason, such as high latency. This is especially problematic if the service being called is a high-demand service because it means that it will get more requests. As a result, we’ll have to allocate more processes, and all these processes will have to wait to call and get a response from the service.
Now, if the majority of our processes are occupied by this one service, that would leave only a few processes for the other services. This leads to the possibility of the leftover processes being occupied by the remaining services and, in turn, blocking all the processes of the application. The requests, however, will not stop coming and will add up until the processes are unblocked. Even after the service recovers, the processes will be busy processing the requests that queued up while the service was unavailable. Before long, it might lead to cascading failures throughout the application.
The scenario above has been illustrated in the following slides:
1 of 5
2 of 5
3 of 5
4 of 5
5 of 5
A scenario like this is perfect for demonstrating the utility of the circuit breaker pattern. First, we’ll have to define a failure threshold for our services. For our case, let's assume it to be 300 ms. That is, if a service is taking longer than 300 ms to respond, we’ll consider it to have failed.
Let's go through this process and see how adding a circuit breaker affects our scenario:
Normally the circuit breaker is in a closed state and all the requests will go through to the service.
If a significant amount of these requests, let’s say 50%, are exceeding the failure threshold (taking longer than 300 ms to get served) we have defined previously, the breaker assumes that this service is unresponsive and will “trip” and go into the open state. The breaker then sends a message to the client, and the client can either wait for the service to be responsive or redirect to another instance. This prevents requests from queuing up and frees up processes because they’re no longer blocked by an unresponsive service.
After the timeout expires, the breaker will move to the half-open state and allow a fraction of the total requests to go through to its corresponding service.
Let's say that the service normally gets 100 requests per second. In the half-open state, the circuit breaker would allow 25 of these requests to pass to the service. If all these requests succeed, then the breaker assumes the service is functioning properly and moves to the closed state so that the service can carry on as usual. However, if even one of the requests in the half-open state fails, then the breaker reverts to the open state and the timeout begins again.
This process has been illustrated by the slides below:
1 of 7
2 of 7
3 of 7
4 of 7
5 of 7
6 of 7
7 of 7
Note: In the slides above we have abbreviated "circuit breaker" to CB.
In the slides above, we have a system with seven processes calling the available services. When 50% of the requests going to Service A end in failure, the circuit breaker attached to it goes into the open state and activates a timeout period for the service, and any subsequent requests to that service will immediately end in failure. Immediate failure frees up the processes to take care of other requests while giving time to Service A to recover.
After the timeout of the open state is over, the circuit breaker switches over to the half-open state. In this state, a few requests are allowed by the circuit breaker to pass to the service, in our case we allow two out of three requests to go through. Because all of them are successful, the circuit breaker reverts to the closed state, and Service A returns to functioning normally.
Cascading failures scenario#
A cascading failure happens when the failure of one service affects the performance of the services dependent on it, causing multiple services in the system to fail.
1 of 4
2 of 4
3 of 4
4 of 4
In the slides above, we have a system with interconnected services. If one fails, then the rest are at risk of failure as well, causing a cascaded failure of the system. The failed service is not able to respond to requests, causing the dependent services to have to wait for too long and eventually fail due to not receiving any response in time.
The circuit breaker pattern can help us in this regard by adding an extra layer that the services have to go through to make a call. This allows services to fail faster if they’re being unresponsive and protects the clients from potentially failing if the target service is unresponsive.
Let's visualize how circuit breakers can help us fix or mitigate the issue of cascading failures:
1 of 4
2 of 4
3 of 4
4 of 4
In the slides above, we have a system of interconnected services, each of which has its own circuit breaker. If Service 6 crosses the failure rate threshold, then the circuit breaker attached to it switches to the open state. As a result, any requests to Service 6 are stopped by the circuit breaker and sends a response to the requesting clients, allowing them to either wait or to be redirected to another instance of the service. This contains the failure to one service because the other services should immediately get responses of failures and can move on to other tasks.
As mentioned in the previous section, the open-state timeout of the circuit breaker gives the service time to repair. Once the timeout is over it will go into the half-open state. In this state, a fraction of the usual requests (in our case, two out of three) are sent to the service, if all of them succeed, then the circuit breaker goes into the closed state and the service starts functioning normally again.
Netflix uses this pattern in their product to make it more resilient and fault tolerant.
They have developed the Hystrix framework, which is based on the circuit breaker pattern we have been studying in this lesson. It’s available for public use and can be added to existing applications fairly easily. If you’re interested, then it’s worth checking out for some more details, but it’s out of the scope of this course.
Summary #
In this lesson, we learned that the circuit breaker pattern is an extremely useful technique to include in our API design strategies. The purpose of this pattern is to allow services to "fail fast and recover ASAP." This pattern will help us in building resilient systems that can protect our system from cascading failures and protect our client processes from being blocked due to having to wait for a response from a failed service.
Quiz
Question
On the surface, it might seem like circuit breakers have a similar function to rate limiters. Why should we use one over the other?
While they may seem to be similar on the surface because they’re both used to limit calls to an API or a service, upon further exploration, we can see that they have two different use cases.
In general, rate limiters are not very complex. They restrict the calls being made to a service in a specific period of time. They do this by monitoring the number of calls being made to a service. If the number of calls in the specified time frame exceeds the limit set by the system, then the rate limiter begins a timeout, during which any calls to the service are dropped. Once the timeout is over, the service returns to normal functioning. The rate limiter essentially helps us protect the server from overloading by controlling the throughput.
One of the main differences circuit breakers have with rate limiters is that they ensure the failure remains isolated to one component and are used to keep the client service safe when the target service is unresponsive. Circuit breakers are smarter and more resilient than rate limiters. This is because they’re able to detect failures and shut off access to the failed service, while rate limiters do no such thing. Therefore, circuit breakers are preferred with more complex systems, such as one with cascading services.
Circuit breakers are also concerned with the health of the service and the half-open state of the circuit breaker is there to check if the service is healthy enough to function properly. On the other hand, the rate limiter is not concerned with the health of the service. It only limits the number of requests made to a service.
Resource Hints and Debouncing
Managing Retries